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Abstract 

Clustering in high-dimensional spaces is nowadays a recurrent problem in many sci- 
entific domains but remains a difficult task from both the clustering accuracy and the 
result understanding points of view. This paper presents a discriminative latent mixture 
(DLM) model which fits the data in a latent orthonormal discriminative subspace with 
an intrinsic dimension lower than the dimension of the original space. By constraining 
model parameters within and between groups, a family of 12 parsimonious DLM models 
is exhibited which allows to fit onto various situations. An estimation algorithm, called 
the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters 
and the discriminative subspace. Experiments on simulated and real datasets show that 
the proposed approach performs better than existing clustering methods while providing 
a useful representation of the clustered data. The method is as well applied to the clus- 
tering of mass spectrometry data. 

Keywords: high-dimensional clustering, model-based clustering, discriminative sub- 
space. Fisher criterion, visualization, parsimonious models. 



1 Introduction 

In many scientific domains, the measured observations are nowadays high-dimensional and 
clustering such data remains a challenging problem. Indeed, the most popular clustering meth- 
ods, which are based on the mixture model, show a disappointing behavior in high-dimensional 
spaces. They suffer from the well-known curse of dimensionality [6] which is mainly due to the 
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fact that model-based clustering methods are over-parametrized in high-dimensional spaces. 
Furthermore, in several applications such as mass spectrometry or genomics, the number of 
available observations is small compared to the number of variables and such a situation 
increases the difficulty of the problem. 

Hopefully, since the dimension of observed data is usually higher than their intrinsic dimen- 
sion, it is theoretically possible to reduce the dimension of the original space without loosing 
any information. Therefore, dimension reduction methods are traditionally used before the 
clustering step. Feature extraction methods such as principal component analysis (PCA) or 
feature selection methods are very popular. However, these approaches of dimension reduction 
do not consider the classification task and provide a sub-optimal data representation for the 
clustering step. Indeed, dimension reduction methods imply an information loss which could 
be discriminative. Only few approaches combine dimension reduction with the classification 
aim but, unfortunately, those approaches are all supervised methods. Fisher discriminant 
analysis (FDA) (see Chap. 4 of |28j ) is one of them in the supervised classification framework. 
FDA is a powerful tool for finding the subspace which best discriminates the classes and re- 
veals the structure of the data. This subspace is spanned by the discriminative axes which 
maximize the ratio of the between class and the within class variances. 

To avoid dimension reduction, several subspace clustering methods have been proposed 
in the past few years to model the data of each group in low-dimensional subspaces. These 
methods turned out to be very efficient in practice. However, since these methods model 
each group in a specific subspace, they are not able to provide a global visualization of the 
clustered data which could be helpful for the practitioner. Indeed, the clustering results of 
high-dimensional data are difficult to understand without a visualization of the clustered 
data. In addition, in scientific fields such as genomics or economics, original variables have an 
actual meaning and the practitioner could be interested in interpreting the clustering results 
according to the variable meaning. 

In order to both overcome the curse of dimensionality and improve the understanding 
of the clustering results, this work proposes a method which adapts the traditional mixture 
model for modeling and classifying data in a latent discriminative subspace. For this, the pro- 
posed discriminative latent mixture (DLM) model combines the model-based clustering goals 
with the discriminative criterion introduced by Fisher. The estimation procedure proposed 
in this paper and named Fisher-EM has three main objectives: firstly, it aims to improve 
clustering performances with the use of a discriminative subspace, secondly, it avoids estima- 
tion problems linked to high dimensions through model parsimony and, finally, it provides a 
low-dimensional discriminative representation of the clustered data. 

The reminder of this manuscript has the following organization. Section [2] reviews the 
problem of high-dimensional data clustering and existing solutions. Section [3] introduces the 
discriminative latent mixture model and its submodels. The link with existing approaches is 
also discussed in Section [3l Section |4] presents an EM-based procedure, called Fisher-EM, for 
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estimating the parameters of the DLM modeh Initiahzation, model selection and convergence 
issues are also considered in Section 21 In particular, the convergence of the Fisher-EM 
algorithm has been proved in this work only for one of the DLM models and the convergence for 
the other models should be investigated. In Section[5l the Fisher-EM algorithm is compared to 
existing clustering methods on simulated and real datasets. Section [6] presents the application 
of the Fisher-EM algorithm to a real- world clustering problem in mass-spectrometry imaging. 
Some concluding remarks and ideas for further works are finally given in Section [71 

2 Related works 

Clustering is a traditional statistical problem which aims to divide a set of observations 
{Ui, . . . , Un} described by p variables into K homogeneous groups. The problem of clustering 
has been widely studied for years and the reader could refer to |2H |30] for reviews on the 
clustering problem. However, the interest in clustering is still increasing since more and 
more scientific fields require to cluster high-dimensional data. Moreover, such a task remains 
very difficult since clustering methods suffer from the well-known curse of dimensionality [6]. 
Conversely, the empty space phenomenon |48| . which refers to the fact that high-dimensional 
data do not fit the whole observation space but live in low-dimensional subspaces, gives hope 
to efficiently classify high-dimensional data. This section firstly reviews the framework of 
model-based clustering before exposing the existing approaches to deal with the problem of 
high dimension in clustering. 

2.1 Model-based clustering and high-dimensional data 

Model-based clustering, which has been widely studied by |211I40| in particular, aims to parti- 
tion observed data into several groups which are modeled separately. The overall population 
is considered as a mixture of these groups and most of time they are modeled by a Gaus- 
sian structure. By considering a dataset of n observations {yi, . . . ,yn} which is divided into 
K homogeneous groups and by assuming that the observations {yi, ■■■,yn} are independent 
realizations of a random vector Y gMP , the mixture model density is then: 



where /(.; 0^) is often the multivariate Gaussian density (/)(.; /i^., S^) parametrized by a mean 
vector fik and a covariance matrix for the A:th component. Unfortunately, model-based 
clustering methods show a disappointing behavior when the number of observations is small 
compared to the number of parameters to estimate. Indeed, in the case of the full Gaussian 
mixture model, the number of parameters to estimate is a function of the square of the 
dimension p and the estimation of this potentially large number of parameters is consequently 
difficult with a small dataset. In particular, when the number of observations n is of the same 
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order than the number of dimensions p, most of the model-based clustering methods have to 
face numerical problems due to the ill-conditioning of the covariance matrices. Furthermore, 
it is not possible to use the full Gaussian mixture model without restrictive assumptions for 
clustering a dataset for which n is smaller than p. Indeed, for clustering such data, it would 
be necessary to invert K covariance matrices which would be singular. To overcome these 
problems, several strategies have been proposed in the literature among which dimension 
reduction and subspace clustering. 

2.2 Dimension reduction and clustering 

Earliest approaches proposed to overcome the problem of high dimension in clustering by 
reducing the dimension before using a traditional clustering method. Among the unsupervised 
tools of dimension reduction, PCA [32] is the traditional and certainly the most used technique 
for dimension reduction. It aims to project the data on a lower dimensional subspace in 
which axes are built by maximizing the variance of the projected data. Non-linear projection 
methods can also be used. We refer to |51) for a review on these alternative dimension 
reduction techniques. In a similar spirit, the generative topographic mapping (GTM) |9] 
finds a non linear transformation of the data to map them on low-dimensional grid. An 
other way to reduce the dimension is to select relevant variables among the original variables. 
This problem has been recently considered in the clustering context by [10] and |;36]. In |45| 
and [38], the problem of feature selection for model-based clustering is recasted as a model 
selection problem. However, such approaches remove variables and consequently information 
which could have been discriminative for the clustering task. 

2.3 Subspace clustering 

In the past few years, new approaches focused on the modeling of each group in specific 
subspaces of low dimensionality. Subspace clustering methods can be split into two cate- 
gories: heuristic and probabilistic methods. Heuristic methods use algorithms to search for 
subspaces of high density within the original space. On the one hand, bottom-up algorithms 
use histograms for selecting the variables which best discriminate the groups. The Clique 
algorithm [1] was one of the first bottom-up algorithms and remains a reference in this family 
of methods. On the other hand, top-down algorithms use iterative techniques which start 
with all original variables and remove at each iteration the dimensions without groups. A 
review on heuristic methods is available in [44]. Conversely, probabilistic methods assume 
that the data of each group live in a low-dimensional latent space and usually model the 
data with a generative model. Earlier strategies |46) are based on the factor analysis model 
which assumes that the latent space is related with the observation space through a linear 
relationship. This model was recently extended in [Sj HI] and yields in particular the well 
known mixture of probabilistic principal component analyzers j45| . Recent works [111 H2] 
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proposed two families of parsimonious and regularized Gaussian models which partially en- 
compass previous approaches. All these techniques turned out to be very efficient in practice 
to cluster high-dimensional data. However, despite their qualities, these probabilistic methods 
mainly consider the clustering aim and do not take enough into account the visualization and 
understanding aspects. 

2.4 Prom Fisher's theory to discriminative clustering 

In the case of supervised classification. Fisher poses, in his precursor work |18| . the problem 
of the discrimination of three species of iris described by four measurements. The main goal 
of Fisher was to find a linear subspace that separates the classes according to a criterion 
(see for more details). For this, Fisher assumes that the dimensionality p of the original 
space is greater than the number K of classes. Fisher discriminant analysis looks for a linear 
transformation U which projects the observations in a discriminative and low dimensional 
subspace of dimension d such that the linear transformation U of dimension p x d aims to 
maximize a criterion which is large when the between-class covariance matrix (Sb) is large 
and when the within-covariance matrix (Sw) is small. Since the rank of Sb is at most equal 
to K — 1, the dimension d of the discriminative subspace is therefore at most equal to K — 1 
as well. Four different criteria can be found in the literature which satisfy such a constraint 
(see [23j for a review). The criterion which is traditionally used is: 

J{U) = trace{{U^SwU)-^U^SBU), (2.2) 

where Sw = ^ Y.k=i T^iacSy'- ~ ^k){yi - and Sb = ^ Ef=i nk{mk - y){mk - yf are 
respectively the within and the between covariance matrices, = ^ft^c^. yi is the em- 
pirical mean of the observed column vector yi in the class k and y = ^ XIa—i ^k'^^k is the 
mean column vector of the observations. The maximization of criterion (j2.2p is equivalent 
to the generalized eigenvalue problem |34) (^S^Sb — A/p) U = and the classical solution of 
this problem is the eigenvectors associated to the d largest eigenvalues of the matrix S^Sb- 
From a practical point of view, this optimization problem can also be solved using generalized 
eigenvalue solvers |24j in order to avoid numerical problems when Sw is ill-conditioned. Once 
the discriminative axes determined, linear discriminant analysis (LDA) is usually applied to 
classify the data into this subspace. The optimization of the Fisher criterion supposes the 
non-singularity of the matrix Sw but it appears that the singularity of Sw occurs frequently, 
particularly in the case of very high-dimensional space or in the case of under-sampled prob- 
lems. In the literature, different solutions [22 [ [25 1 1 ^ l^^ l l^^^l s-^e proposed to deal with such a 
problem in a supervised classification framework. In addition, since clustering approaches are 
sensitive to high-dimensional and noisy data, recent works |16| [35| \T5[ [53] focused on combin- 
ing low dimensional discriminative subspace with one of the most used clustering algorithm: 
k-means. However, these approaches do not really compute the discriminant subspace and 
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are not interested in the visualization and the understanding of the data. 

3 Model-based clustering in a discriminative subspace 

This section introduces a mixture model, called the discriminative latent mixture model, which 
aims to find both a parsimonious and discriminative fit for the data in order to generate a 
clustering and a visualization of the data. The modeling proposed in this section is mainly 
based on two key ideas: firstly, actual data are assumed to live in a latent subspace with an 
intrinsic dimension lower than the dimension of the observed data and, secondly, a subspace 
oi K — 1 dimensions is theoretically sufficient to discriminate K groups. 

3.1 The discriminative latent mixture model 

Let {yi, . . . ,yn} € denote a dataset of n observations that one wants to cluster into K 
homogeneous groups, i.e. adjoin to each observation yj a value zj G {1, . . . ,K} where Zi = k 
indicates that the observation y.j belongs to the A;*^ group. On the one hand, let us assume 
that {yi, . . . ,yn} arc independent observed realizations of a random vector y € and that 
{zi, . . . , Zn} are also independent realizations of a random vector Z G {!,..., K}. On the 
other hand, let E C M*" denote a latent space assumed to be the most discriminative subspace 
of dimension d < K — 1 such that G E and where d is strictly lower than the dimension p of 
the observed space. Moreover, let {xi, . . . ,Xn} G E denote the actual data, described in the 
latent space E of dimension d, which are in addition presumed to be independent unobserved 
realizations of a random vector X G E. Finally, for each group, the observed variable y G 
and the latent variable X G E are assumed to be linked through a linear transformation: 

Y = UX + s, (3.1) 

where d < p, U is the p x d orthogonal matrix common to the K groups, such as U^U = 1^,, 
and £ G W, conditionally to Z, is a centered Gaussian noise term with covariance matrix '^k, 
for k = 1,...,K: 

£|z=fc~Ar(0,^'fe). (3.2) 

Following the classical framework of model-based clustering, each group is in addition assumed 
to be distributed according to a Gaussian density function within the latent space E. Hence, 
the random vector X G E has the following conditional density function: 

X\Z = k^M{fik,^k), (3.3) 

where //^ G and G W^^*^ are respectively the mean and the covariance matrix of the fcth 
group. Conditionally to X and Z, the random vector y G M'' has the following conditional 
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distribution: 

Y\X,Z = k^N{UX,-^k), 
and its marginal distribution is therefore a mixture of Gaussians: 

K 



(3.4) 



(3.5) 



f{y) = ^ 7rfc(/)(y; rrik, Sk), 
k=i 

where tt^ is the mixture proportion of the kth group and: 



rrik = Ufik, 

are respectively the mean and the covariance matrix of the A;th group in the observation space. 
Let us also define W = [U,V] apxp matrix which satisfies W^W = WW^ = Ip and for which 
the p X [p — d) matrix V , is the ortlionormal complement of U defined above. We finally 
assume that the noise covariance matrix satisfies the conditions l/^'^y* = (ikld-p and 
U^kU* = Od, such that Afc = W^SkW has the foUowing form: 



A. 



V 











Pk 




d<K-l 



(p-d) 



This model, called the discriminative latent mixture (DLM) model and referred to by DLMpj./?^.] 
in the sequel, is summarized by Figure [TJ The DLMp^^^^j model is therefore parametrized 
by the parameters vr^, /i^, U, and /3k, for k = 1, ...,K and j = 1, d. On the one hand, 
the mixture proportions tti,...,ttk and the means fj,i,...,fiK parametrize in a classical way 
the prior probability and the average latent position of each group respectively. On the other 
hand, U defines the latent subspace E by parametrizing its orientation according to the basis 
of the original space. Finally, parametrize the variance of the kth group within the latent 
subspace E whereas (3k parametrizes the variance of this group outside E. With these nota- 
tions and from a practical point of view, one can say that the variance of the actual data is 
therefore modeled by Y^k and the variance of the noise is modeled by (3k- 
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Figure 1: Graphical summary of the DLMpj./?^.] model 
3.2 The submodels of the DLMpj^/?^.] model 

Starting with the DLMp^^^] model presented in the previous paragraph, several submodels 
can be generated by applying constraints on parameters of the matrix A^. For instance, 
the covariance matrices Si, ... , Tik in the latent space can be assumed to be common across 
groups and this submodel will be referred to by DLMp/s^.]- Similarly, in each group, can 
be assumed to be diagonal, i.e. S^. = diag(afci, . . . , a^^). This submodel will be referred 
to by DLM[^^_^.^^_]. In the same manner, the p — d last values of can be assumed to be 
common for the k classes, i.e. (3^ = /3, Wk = 1,...,K, meaning that the variance outside the 
discriminant subspace is common to all groups. This assumption can be viewed as modeling 
the noise variance with a unique parameter which seems natural for data obtained in a common 
acquisition process. Following the notation system introduces above, this submodel will be 
referred to by DLMj^^,^,^]. The variance within the latent subspace E can also be assumed 
to be isotropic for each group and the associated submodel is DLM[q,^^^]. In this case, the 
variance of the data is assumed to be isotropic both within E and outside E. Similarly, 
it is possible to constrain the previous model to have the parameters /3fc common between 
classes and this gives rise to the model DLMj^^,^]. Finally, the variance within the subspace E 
can be assumed to be independent from the mixture component and this corresponds to the 
models DLM[q,^,^^], DLMjq,^^], DLMjq,^^] and DLM[q,^]. We therefore enumerate 12 different 
DLM models and an overview of them is proposed in Table [TJ The table also gives the 
maximum number of free parameters to estimate (case d = K — 1) according to K and p 
for the 12 DLM models and for some classical models. The Full-GMM model refers to the 
classical Gaussian mixture model with full covariance matrices, the Com-GMM model refers 
to the Gaussian mixture model for which the covariance matrices are assumed to be equal to 
a common covariance matrix [S^ = S, V/c), Diag-GMM refers to the Gaussian mixture model 
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Model Nb. of parameters K = 4 and p = 100 



DLMrv, «,1 


{K-l) + K{K -1) + {K- l){p - K/2) + K^{K -l)/2 + K 


337 




(K — 1) + K(K — 1) + (K — l)(v - K/2) + K'^(K -1)72 + 1 


334 


DLMrsfl.l 


{K-\) + K{K -V) + (K - l){p - K/2) + KiK -l)/2 + K 


319 


DLMrv^l 


(K-l) + KiK -l) + iK - l)ip - KI2) + K{K - l)/2 + 1 


316 


L"fcjPfcJ 


- 1) + KiK -\) + iK- l)(p - KI2) + ^2 


325 




- 1) + KiK - 1) + - l)(p - KI2) + KiK - 1) + 1 


322 


DLMr„ 1 


(K -\\ + K(K -1) + (K - l)(v - K/2) + 2K 


317 




{K-l) + K{K - 1) + {K - l){p - K/2) + K + 1 


314 




{K-l) + K{K -1) + {K- l){p - K/2) + {K-l) + K 


316 




- 1) + K{K -1) + {K- l){p - K/2) + (i^: - 1) + 1 


313 




{K-l) + K{K -1) + {K- l){p - K/2) + K + 1 


314 


DLM[„^] 


{K -1) + K{K -1) + {K - l){p - K/2) + 2 


311 


Full-GMM 


{K -l) + Kp + Kp{p + l)/2 


20603 


Com-GMM 


{K -I) + Kp + p{p+ l)/2 


5453 


Mixt-PPCA 


{K-1) + Kp + K{d{p -{d+ l)/2) + d + 1) + 1 


1198 {d = 3) 


Diag-GMM 


{K -1) + Kp + Kp 


803 


Sphe-GMM 


{K-l)+Kp + K 


407 


Table 1: Number of free parameters to estimate when d = K — 1 for the DLM models and 
some classical models (see text for details). 
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for which Sk = diag(s^-^, s^^) with s| G and Sphe-GMM refers to the Gaussian mixture 
model for which = sf.Ip with € M. Finally, Mixt-PPCA denotes the subspace clustering 
model proposed by Tipping and Bishop in |49) . In addition to the number of free parameters 
to estimate, Table [1] gives this number for specific values of K and p in the right column. The 
number of free parameters to estimate given in the central column can be decomposed in the 
number of parameters to estimate for the proportions {K — 1), for the means (Kp) and for 
the covariance matrices (last terms). Among the classical models, the Full-GMM model is a 
highly parametrized model and requires the estimation of 20603 parameters when K = 4 and 
p = 100. Conversely, the Diag-GMM and Sphe-GMM model are very parsimonious models 
since they respectively require the estimation of only 803 and 407 parameters when K = 4 
and p = 100. The Com-GMM and Mixt-PPCA models appear to both have an intermediate 
complexity. However, the Mixt-PPCA model is a less constrained model compared to the 
Diag-GMM model and should be preferred for clustering high-dimensional data. Finally, the 
DLM models turn out to have a low complexity whereas their modeling capacity is comparable 
to the one of the Mixt-PPCA model. In addition, the complexity of the DLM models depends 
only on K and p whereas the Mixt-PPCA model depends from an hyper-parameter d. 

3.3 Links with existing models 

At this point, some links can be established with models existing in the clustering literature. 
The closest models have been proposed in [5], |11| and |42| and are all derived from the 
mixture of factor analyzer (MFA) model |41l 146) . First, in |11) . the authors proposed a 
family of 28 parsimonious and flexible Gaussian models ranging from a very general model, 
referred to as [akjbkQkdk], to very simple models. Compared to the standard MFA model, 
these parsimonious models assume that the noise variance is isotropic. In particular, this 
work can be viewed as an extension of the mixture of principal component analyzer (Mixt- 
PPCA) model |49| . Among this family of parsimonious models, 14 models assume that the 
orientation of the group-specific subspaces is common (common Qk)- The following year, 
McNicholas and Murphy |42| proposed as well a family of 8 parsimonious Gaussian models 
by extending the MFA model by constraining the loading and error variance matrices across 
groups. In this work, the noise variance can be isotropic or not. Let us remark that the 
two families of parsimonious Gaussian models share some models: for instance, the model 
UUC of [42j corresponds to the model [akjbkQkd] of [llj. Among the 8 parsimonious models 
presented in [32], 4 models have the loading matrices constrained across the groups. More 
recently. Beak et al. [5j proposed as well a MFA model with a common loading matrix. In this 
case, the noise variance is not constrained. Despite their differences, all these parsimonious 
Gaussian models share the assumption that the group subspaces have a common orientation 
and are therefore close to the DLM models presented in this work. However, these models 
with common loadings choose the orientation such as the variance of the projected data is 
maximum whereas the DLM models choose the latent subspace orientation such as it best 
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discriminates the groups. This specific feature of the DLM models should therefore improve 
in most cases both the clustering and the visualization of the results. In particular, the 
DLM models should be able to better model situations where the axes carrying the greatest 
variance are not parallel to the discriminative axes than the other approaches (Figure 10.1 
of |23| illustrates such a situation). 

4 Parameter estimation: the Fisher-EM algorithm 

Since this work focuses on the clustering of unlabeled data, this section introduces an esti- 
mation procedure which adapts the traditional EM algorithm for estimating the parameters 
of DLM models presented in the previous section. Due to the nature of the models described 
above, the Fisher-EM algorithm alternates between three steps: 

• an E step in which posterior probabilities that observations belong to the K groups are 
computed, 

• a F step which estimates the orientation matrix U of the discriminative latent space 
conditionally to the posterior probabilities, 

• a M step in which parameters of the mixture model are estimated in the latent subspace 
by maximizing the conditional expectation of the complete likelihood. 

This estimation procedure relative to the DLM models is called hereafter the Fisher-EM 
algorithm. We chose to name this estimation procedure after Sir R. A. Fisher since the key 
idea of the F step comes from his famous work on discrimination. The remainder of this section 
details the simple form of this procedure. Let us however notice that the Fisher-EM algorithm 
can be also used in combination with the stochastic [^5] and classification versions [14j of the 
EM algorithm. 

4.1 The E step 

This step aims to compute, at iteration (g), the expectation of the complete log- likelihood 
conditionally to the current value of the parameter 6^'i~^\ which, in practice, reduces to the 
computation of = E[zik\yi,9'^'^~^^] where Zik = 1 if ?/« comes from the kih component 
and = otherwise. Let us also recall that t^^^ is as well the posterior probability that 
the observation yi belongs to the k^^ component of the mixture. The following proposition 
provides the explicit form of t-^'', for i = l,...,n, k = l,...,i^, in the case of the model 
DLMp;,/3j,]- Demonstration of this result is detailed in Appendix lA.ll 

Proposition 1. With the assumptions of the mode/ DLMp^.^^], the posterior probabilities t^^ , 
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i = 1, ...,n, k = 1, ■■■,K , can be expressed as 



" E£.exp(i(rr»fa)-rr)fe)r 



with: 



rt'\y^^ = \\P{yr - mt'^)\\l^ + ^ll(y. - mt'^) - P{V^ - mt'^)\? 

f^k (4.2) 

+ log +{p- d)log{^t'^) - 21og(4''-^): 



■7, 

where ||-||x)j. ^-^ norm on the latent space E defined by \\y\\x)^ = y^T>ky, T^k = WA'^^W^, 
W is a p X p matrix containing the d vectors of [/(^-i) completed by zeros such as W = 
[U^'^^^\Op^d], P is the projection operator on the latent space E, i.e. -P(y) = U^'^^^^U^'^^^^^y, 
and 7 = plog(27r) is a constant term. 

Besides its computational interest, Proposition [T] provides as well a comprehensive inter- 
pretation of the cost function which mainly governs the computation of tik- Indeed, it 
appears that mainly depends on two distances: the distance between the projections on 
the discriminant subspace E of the observation yi and the mean mk on the one hand, and, 
the distance between the projections on the complementary subspace E"*- of yi and on the 
other hand. Remark that the latter distance can be reformulated in order to avoid the use 
of the projection on E"*-. Indeed, as Figure [2] illustrates, this distance can be re-expressed ac- 
cording projections on E. Therefore, the posterior probability tik = P{zik = will be close 
to 1 if both the distances are small which seems quite natural. Obviously, these distances are 
also balanced by the variances in E and E"*- and by the mixture proportions. Furthermore, 
the fact that the E step does not require the use of the projection on the complementary 
subspace E"*- is, from a computational point of view, very important because it will provide 
the stability of the algorithm and will allow its use when n <C p (c/. paragraph 14. 6p . 

4.2 The F step 

This step aims to determinate, at iteration (q), the discriminative latent subspace of dimension 
d < K — 1 ill which the K groups are best separated. Naturally, the estimation of this latent 
subspace has to be done conditionally to the current values of posterior probabilities t^l^ 
which indicates the current soft partition of the data. Estimating the discriminative latent 
subspace E^"^) reduces to the computation of d discriminative axes. Following the original idea 
of Fisher |18| . the d axes which best discriminate the K groups are those which maximize the 
traditional criterion J{U) = tr{{U'' Sy/U)~^U'' S bU) . However, the traditional criterion J{U) 
assume that the data are complete (supervised classification framework). Unfortunately, the 
situation of interest here is that of unsupervised classification and the matrices Sb and Sw 
have therefore to be defined conditionally to the current soft partition. Furthermore, the 
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DLM models assume that the discriminative latent subspace must have an orthonormal basis 
and, sadly, the traditional Fisher's approach provides non-orthogonal discriminative axes. 

To overcome both problems, this paragraph proposes a procedure which keeps the key 
idea of Fisher while providing orthonormal discriminative axes conditionally to the current 
soft partition of the data. The procedure follows the concept of the orthonormal discriminant 
vector (ODV) method introduced by [19j in the supervised case and then extended by |25| f2U[ 
[371 [52] , which sequentially selects the most discriminative features in maximizing the Fisher 
criterion subject to the orthogonality of features. First, it is necessary to introduce the soft 
between-covariance matrix S^'' and the soft within-covariance matrix S^} . The soft between- 
covariance matrix S^^^ is defined conditionally to the posterior probabilities t^'^ , obtained in 
the E step, as follows: 

^S?^ = ^E-?^(-f (4-3) 

k=l 

where n^"^^ = '}27=i^ik ^ '^i''^ ~ n Si=i ^ifc''^* mean of the kth group at iteration 

q and y = ;^ J/i is the empirical mean of the whole dataset. Since the relation S = 

+ 5*^^ holds in this context as well, it is preferable from a computational point of view to 
use the covariance matrix S = ^ Yl^=iiyi~y)iyi~yY whole dataset in the maximization 

problem instead of S^) since S remains fixed over the iteration. The F step of the Fisher-EM 
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therefore aims to solve the following optimization problem: 

(4.4) 



max trace ( (U^SUy^U^S^^^U) , 



wrt ti>/ = 0, yj ^ I e {I,. . . ,d}, 

where Uj is the jth colmnn vector of U. Following the ODV procedure, the d axes solution 
of this optimization problem are iteratively constructed by, first, computing an orthogonal 
complementary subspace to the current set of discriminative axes and, then, maximizing the 
Fisher criterion in this orthogonal subspace by solving the associated generalized eigenvalue 
problem. To initialize this iterative procedure, the first vector of U is therefore the eigen- 
vector associated with the largest eigenvalue of the matrix S~^S^^\ Then, assuming that 
the r — 1 first orthonormal discriminative axes {ui, . . . ,Ur-i}, which span the space Br-i, 
have been computed, the r*'^ discriminative axis has to lie in the subspace B:^_i orthogonal 
to the space Br-i- The Gram-Schmidt ortlionormalization procedure allows to find a basis 
V^' = {vr,Vr+i, ■■■,Vd} for the orthogonal subspace B:^_i such that: 

£-1 

vi = ai{Ii_i -'^VjV*j)ipi, e = r,...,p (4.5) 

i=i 

where Vj = Uj for j = 1, ...,r — 1, ag is normalization constant such that \\ui\\ = 1 and ipi is 
a vector linearly independent of uj Vj G {1, — 1}. Then, the rth discriminative axis is 
given by: 

p ^„,max 

1 1 yinax \ ■ / 

where Pr-i is the projector on Br-i, u^""^ is the eigenvector associated with the largest 
eigenvalue of the matrix S~^S^^l with: 

i.e. Sr and 5"^^ are respectively the covariance and soft between-covariance matrices of the 
data projected into the orthogonal subspace B:^^^. This iterative procedure stops when the d 
orthonormal discriminative axes uj are computed. 

4.3 The M step 

This third step estimates the model parameters by maximizing the conditional expectation of 
the complete likelihood. The following proposition provides the expression of the conditional 
expectation of the complete log-likelihood in the case of the DLMpj.^^,] model. A proof of 
this result is provided in Appendix lA. 21 
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Proposition 2. In the case of the mode/ DLMp^.^^.], the conditional expectation of complete 
log-likelihood Q{yi, 0) has the following expression: 

1 ^ 

Qiyi, ...,yn,e) = --^nk[-2 log(7rfc) + tvaceiJ:^^ U'CkU) + log(|Sfc|) 
k=i 

+ {p-d) log(/3fc) + — j trace(Cfc) - ^ u]CkuA + 7 . 



where Ck is the empirical covariance matrix of the k^^ group, uj is the jth column vector ofU, 
Hk = Y17=i ^ik ^''^^ 'y ~ ^-^ ^ constant term. 

At iteration q, the maximization of Q conduces to an estimation of the mixture proportions 
TTfc and the means Hk for the K components by their empirical counterparts: 

-(g) _ rik 



" n 



n 



i=i 

where = ^^=1 and contains, as columns vectors, the d discriminative axes n^- , 
j = 1, d, provided by the F step at iteration q. The following proposition provides estimates 
for the remaining parameters for the 12 DLM models which have to be updated at each 
iteration of the FEM procedure. Proofs of the following results are given in Appendix IA.2I 

Proposition 3. At iteration q, the estimates for variance parameters of the 12 DLM models 



are: 



Model DLMp,/3,].- 

^(9) ^ jj{q)t(j{i)jj{q) oil) ^ trace(Ofc ) - l^j^^ Uj 
k k ■> h'k p - d ' 

Model DLMp^^j; 

^(^) = f/(9)*c(%(^) trace(C(^))-E?..^fg(^)nf ^ 

k fc ' p — d 

Model DLMp^^] ; 



(4.9) 



trarpl'r^''^ - ,,(9)i^(g) (?) 

' ^ p — d ' 
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Model DLMp/3] 



' p — d 



Model DLM[„^.^]; 



Model DLM[„^^^]: 



Mociei DLM 



Model DLM[„.^^]; 



J J J K p — d 



Model DLM 



J J J p — d 



Model DLM[o,^^] : 



^k 
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Model DLMr 



= iy J.5)*C('?)n''\ /3(«) = ^ — ±±2^L^ (4.19) 

d ^-^ J J p — d 



where the vectors u - are the discriminative axes provided by the F step at iteration q, = 
~Tq)^l^=i^'ik {Vi ~ ''^i!^)iyi ~ ^^k^Y is soft covariance matrix of the kth group, m^^^ = 

n Si=i^!fc^2/j finally C = ^"^^^inkCk is the soft within- covariance matrix of the K 
groups. 



4.4 Initialization and model selection 

Since the Fisher-EM procedure presented in this work belongs to the family of EM-based 
algorithms, the Fisher-EM algorithm can inherit the most efficient strategies for initialization 
and model selection from previous works on the EM algorithm. 

Initialization Although the EM algorithm is widely used, it is also well-known that the 
performance of the algorithm is linked to its initial conditions. Several strategies have been 
proposed in the literature for initializing the EM algorithm. A popular practice [8] executes the 
EM algorithm several times from a random initialization and keep only the set of parameters 
associated with the highest likelihood. The use of k-means or of a random partition are 
also standard approaches for initializing the algorithm. McLachlan and Peel |40j have also 
proposed an initialization through the parameters by generating the mean and the covariance 
matrix of each mixture component from a multivariate normal distribution parametrized by 
the empirical mean and empirical covariance matrix of the data. In practice, this latter 
initialization procedure works well but, unfortunately, it cannot be applied directly to the 
Fisher-EM algorithm since model parameters live in a space different from the observation 
space. A simple way to adapt this strategy could be to first determine a latent space using 
PCA and then simulate mixture parameters in this initialization latent space. 

Model selection In model-based clustering, it is frequent to consider several models in 
order to find the most appropriate model for the considered data. Since a model is defined by 
its number of component K and its parametrization, model selection allows to both select a 
parametrization and a number of components. Several criteria for model selection have been 
proposed in the literature and the famous ones are penalized likelihood criteria. Classical 
tools for model selection include the AIC [2], BIC |47) and ICL criteria. The Bayesian 
Information Criterion (BIC) is certainly the most popular and consists in selecting the model 
which penalizes the likelihood by "^^^^ log(n) where ^{M) is the number of parameters in 
model M and n is the number of observations. On the other hand, the AIC criterion penalizes 



17 



the log-likelihood by 7(7M) whereas the ICL criterion add the penalty X^"^]^ ^ik ^osi^ik) 

to the one of the BIG criterion in order to favor well separated models. The value of J^M) is 
of course specific to the model selected by the practitioner ( cf. Table [1]) . In the experiments of 
the following sections, the BIG criterion is used because of its popularity but the IGL criterion 
should also be well adapted in our context. 

4.5 Computational aspects 

As all iterative procedures, the convergence, the stopping criterion and the computational 
cost of the Fisher-EM algorithm deserve to be discussed. 

Convergence Although the Fisher-EM algorithm presented in the previous paragraphs is 
an EM-like algorithm, it does not satisfy at a first glance to all conditions required by the 
convergence theory of the EM algorithm. Indeed, the update of the orientation matrix U in 
the F step is done by maximizing the Fisher criterion and not by directly maximizing the 
expected complete log-likelihood as required in the EM algorithm theory. From this point of 
view, the convergence of the Fisher-EM algorithm cannot therefore be guaranteed. However, 
as demonstrated by Gampbell |12| in the supervised case and by Geleux and Govaert |14| in the 
unsupervised case, the maximization of the Fisher criterion is equivalent to the maximization 
of the complete likelihood when all mixture components have the same diagonal covariance 
matrix (Sk = cr^ip for k = 1, ...,K). In our model, by considering the homoscedastic case with 
a diagonal covariance matrix, the conditional expectation of the complete log-likelihood can 
be rewritten as — ^ trace (^{U^SU^ ^ (U^WU^^ + 7 where 7 is a constant term according 
to U. Hence, with these assumptions, maximizing this criterion according to U is equivalent 
to minimizing the Fisher criterion trace (^{U^SU) ^ {U^WU)y Gonsequently, for the model 
DLMjq^] which assumes the equality and the diagonality of covariance matrices, the F step 
of the Fisher-EM algorithm satisfies the convergence conditions of the EM algorithm theory 
and the convergence of the Fisher-EM algorithm can be guaranteed in this case. For the other 
DLM models, although the convergence of the Fisher-EM procedure cannot be guaranteed, 
our practical experience has shown that the Fisher-EM algorithm rarely fails to converge with 
these models if correctly initialized. 

Stopping criterion and convergence monitoring To decide whether the algorithm has 
converged or not, we propose to use the Aitken's criterion |39| . This criterion estimates the 
asymptotic maximum of the log-likelihood in order to detect in advance the algorithm conver- 
gence. Indeed, the convergence of the EM algorithm can be sometimes slow in practice due 
to its linear convergence rate and it is often not necessary to wait for the actual convergence 
to obtain a good parameter estimate under standard conditions. At iteration q, the Aitken's 
criterion is defined by A(i'> = - / - where is the log-likelihood 
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value at iteration q. Then, asymptotic estimate of the log-Ukehhood maximum is given by: 



small positive number (provided by the user). In practice, if the criterion is not satisfied 
after a maximum number of iterations (provided by the user as well), the algorithm stops. 
Afterward, it is possible to check whether the provided estimate is a local maximum by 
computing the Hessian matrix (using finite differentiation) which should be negative definite. 
In the experiments presented in the following section, the convergence of the Fisher-EM 
algorithm has been checked using such an approach. 

Computational cost Obviously, since the additional F step is iterative, the computational 
complexity of the Fisher-EM procedure is somewhat bigger than the one of the ordinary EM 
algorithm. The F step requires d{d — 2)/2 iterations due to the Gram-Schmidt procedure used 
for the orthogonalization of U. However, since d is at most equal to K — 1 and is supposed to 
be small compared to p, the complexity of the F step is not a quadratic function of the data 
dimension which could be large. Furthermore, it is important to notice that the complexity of 
this step does not depend on the number of observations n. Although the proposed algorithm 
is more time consuming than the usual EM algorithm, it is altogether actually usable on recent 
PCs even for large scale problems. Indeed, we have observed on simulations that Fisher-EM 
appears to be 1.5 times slower on average than EM (with a diagonal model). As an example, 
24 seconds are on average necessary for Fisher-EM to cluster a dataset of 1 000 observations 
in a 100-dimensional space whereas EM requires 16 seconds. 

4.6 Practical aspects 

The DLM models, for which the Fisher-EM algorithm has been proposed as an estimation 
procedure, presents several practical and numerical interests among which the ability to vi- 
sualize the clustered data, to interpret the discriminative axes and to deal with the so-called 
n <^ p problem. 

Choice of d and visualization in the discriminative subspace The proposed DLM 
models are all parametrized by the intrisinc dimension d of the discriminative latent subspace 
which is theoretically at most equal to K — 1. Even though the actual value of d is strictly 
smaller than K — 1 for the dataset at hand, we recommand in practice to set d = K — 1 
when numerically possible in order to avoid stability problems with the Fisher-EM algorithm. 
Furthermore, it is always better to extract more discriminative axes than to miss relevant 
dimensions and K — 1 is often in practice a small value compared to p. Besides, a natural use 
of the discriminative axes may certainly be the visualization of the clustered data. Indeed, it 




(4.20) 



and the algorithm can be considered to have converged if ^6o is smaller than a 
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is nowadays clear that the visuahzation help human operators to understand the results of an 
analysis. With the Fisher-EM algorithm, it is easy to project and visualize the cluster data 
into the estimated discriminative latent subspace ii K < 4. When K > 4, the actual value of 
d can be estimated by looking at the eigenvalue scree of S^Sb and two cases have therefore 
to be considered. On the one hand, if the estimated value of d is at most equal to 3, the 
practitioner can therefore visualize his data by projecting them on the d first discriminative 
axes and no discriminative information loss is to be deplored in this case. On the other hand, 
if the estimated value of d is strictly larger than 3, the visualization becomes obviously more 
difficult but the practitioner may simply use the 3 first discriminative axes which are the most 
discriminative ones among the K — 1 provided axes. Let us finally notice that the visualization 
quality is of course related to the clustering quality. Indeed, the visualization provided by 
the Fisher-EM algorithm may be disappointing if the clustering results are poor, due to a 
bad initialization for instance. A good solution to avoid such a situation may be to initialize 
the Fisher-EM algorithm with the "mini-EM" strategy or with the results of a classical EM 
algorithm. 

Interpretation of the discriminative axes Beyond the natural interest of visualization, 
it may also be useful from a practical point of view to interpret the estimated discriminative 
axes, i.e. ui,...,Ud with the notations of the previous sections. The main interest for the 
practitioner would be to figure out which original dimensions are the most discriminative. 
This can be done by looking at the matrix U which contains ui,...,Uii as column vectors. 
In the classical framework of factor analysis, this matrix is known as the loading matrix 
(the discriminative axes ui,...,Ud are the loadings). Thus, it is possible to find the most 
discriminative original variables by selecting the highest values in the loadings. A simple way 
to highlight the relevant variables is to threshold the loadings (setting to zero the values less 
than a given threshold). Let us finally remark that finding the most discriminative original 
variables is of particular interest in application fields, such as biology or economics, where the 
observed variables have an actual meaning. 

Dealing with the n <^ p problem Another important and frequent problem when cluster- 
ing high-dimensional data is known as high dimension and low sample size (HDSS) problem 
or the n <^ p problem (we refer to |28| Chap. 18] for an overview). The n <^ p problem 
refers to situations where the number of features p is larger than the number of available 
observations n. This problem occurs frequently in modern scientific applications such as ge- 
nomics or mass spectrometry. In such cases, the estimation of model parameters for generative 
clustering methods is either difficult or impossible. This task is indeed very difficult when 
n <^ p since generative methods require, in particular, to invert covariance matrices which are 
ill-conditioned in the best case or singular in the worst one. In contrast with other generative 
methods, the Fisher-EM procedure can overcome the n <^ p problem. Indeed, the E and 
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M steps of Fisher-EM do not require the determination of the last p — d columns of W (see 
equations (j4.2p and (j4.18p - (j4.19p ) and, consequently, it is possible to modify the F step to 
deal with situations where n <^ p. To do so, let Y denote the centered data matrix and T 
denote, as before, the soft partition matrix. We define in addition the weighted soft partition 
matrix T where the jth column Tj of T is the jth column Tj of T divided by nj = Y17=i ^ij- 
With these notations, the between covariance matrix B can be written in its matrix form 
B = Y^T^TY and the F step aims to maximize, under orthogonality constraints, the func- 
tion /([/) = trace (^{U^Y^YU)-^U^Y''f^fYU^ . It follows from the classical result of kernel 
theory, the Representer theorem |33j, that this maximization can be done in a different space 
and that U can be expressed as U = YH where H € M^^^. Therefore, the F step reduces to 
maximize, under orthogonality constraints, the following function: 



where G = YY^ is the n x n Gram matrix. The solution U* of the original problem can be 
obtained afterward from the solution H* of (|4.2ip by multiplying it by Y. Thus, the F step 
reduces to the eigendecomposition under orthogonality constraints of a n x n matrix instead of 
apxp matrix. This procedure is useful for the Fisher-EM procedure only because it allows to 
determine d < n axes which are enough for Fisher-EM but not for other generative methods 
which require the computation of the p axes. 

5 Experimental results 

This section presents experiments on simulated and real datasets in order to highlight the 
main features of the clustering method introduced in the previous sections. 

5.1 An introductory example: the Fisher's irises 

Since we chose to name the clustering algorithm proposed in this work after Sir R. A. Fisher, 
the least we can do is to first apply the Fisher-EM algorithm to the iris dataset that Fisher 
used in |18J as an illustration for his discriminant analysis. This dataset, in fact collected by 
E. Anderson [3] in the Gaspe peninsula (Canada), is made of three groups corresponding to 
different species of iris {setosa, versicolor and virginica) among which the groups versicolor 
and virginica are difficult to discriminate (they are at least not linearly separable). The 
dataset consists of 50 samples from each of three species and four features were measured 
from each sample. The four measurements are the length and the width of the sepal and the 
petal. This dataset is used here as an introductory example because of the link with Fisher's 
work but also of its popularity in the clustering community. 

In this first experiment, Fisher-EM has been applied to the iris data (of course, the labels 
have been used only for performance evaluation) and the Fisher-EM results will be com- 
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Figure 3: Projection of clustered Iris data into the latent discriminative subspace with Fisher- 
EM (left) and evolution of the associated log-likelihood (right). 
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Setosa 50 
Versicolor 48 2 
Virginica 1 49 


Setosa 50 
Versicolor 47 3 
Virginica 50 


Misclassification rate = 0.02 


Misclassification rate = 0.02 



Table 2: Confusion tables for the iris data with OLDA method (supervised) and Fisher-EM 
(unsupervised) . 
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sepal length 
sepal width 
petal length 
petal width 


0.209 0.044 
0.386 0.665 
-0.554 -0.356 
-0.707 0.655 


-0.203 -0.108 
-0.422 0.088 
0.602 0.736 
0.646 -0.662 



Table 3: Fisher axes estimated in the supervised case (OLDA) and in the unsupervised case 
(Fisher-EM). 
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pared to the ones obtained in the supervised case with the orthogonal hnear analysis method 
(OLDA) |52) . The left panel of Figure [3] stands for the projection of the irises in the esti- 
mated discriminative space with Fisher-EM and the right panel shows the evolution of the 
log-likelihood on 25 iterations until convergence. First of all, it can be observed that the 
estimated latent space discriminates almost perfectly the three different groups. For this ex- 
periment, the clustering accuracy has reached 98% with the DLMjq,^^] model of Fisher-EM. 
Secondly, the right panel shows the monotonicity of the evolution of the log-likelihood and the 
convergence of the algorithm to a stationary state. Table [2] presents the confusion matrices for 
the partitions obtained with supervised and unsupervised classification methods. OLDA has 
been used for the supervised case (reclassification of the learning data) whereas Fisher-EM 
has provided the clustering results. One can observe that the obtained partitions induced by 
both methods is almost the same. This confirms that Fisher-EM has correctly modeled both 
the discriminative subspace and the groups within the subspace. It is also interesting to look 
at the loadings provided by both methods. Table [3] stands for the linear coefficients of the 
discriminative axes estimated, on the one hand, in the supervised case (OLDA) and, on the 
other hand, in the unsupervised case (Fisher-EM). The first axes of each approach appear 
to be very similar and the scalar product of these axes is —0.996. This highlights the per- 
formance of the Fisher-EM algorithm in estimating the discriminative subspace of the data. 
Furthermore, according to these results, the 3 groups of irises can be mainly discriminated by 
the petal size meaning that only one axis would be sufficient to discriminate the 3 iris species. 
Besides, this interpretation turns out to be in accordance with the recent work of Trendafilov 
and Joliffe [50] on variable selection in discriminant analysis via the LASSO. 

5.2 Simulation study: influence of the dimension 

This second experiment aims to compare with traditional methods the stability and the ef- 
ficiency of the Fisher-EM algorithm in partitioning high-dimensional data. Fisher-EM is 
compared here with the standard EM algorithm (Full-GMM) and its parsimonious models 
(Diag-GMM, Sphe-GMM and Com-GMM), the EM algorithm applied in the first compo- 
nents of PGA explaining 90% of the total variance (PGA-EM), the k- means algorithm and 
the mixture of probabilistic principal component analyzers (Mixt-PPGA). For this simulation, 
600 observations have been simulated following the DLM[^^^,^^] model proposed in Section [3l 
The simulated dataset is made of 3 unbalanced groups and each group is modeled by a Gaus- 
sian density in a 2-dimensional space completed by orthogonal dimensions of Gaussian noise. 
The transformation matrix W has been randomly simulated such as W^W = WW^ = Ip and, 
for this experience, the dimension of the observed space varies from 5 to 100. The left panel of 
Figure m shows the simulated data in their 2-dimensional latent space whereas the right panel 
presents the projection of 50-dimensional observed data on the two first axes of PGA in the 
observed space. As one can observe, the representation of the data on the two first principal 
components is actually not well suited for clustering these data while it exists a representation 
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Figure 4: Visualization of the simulated data: data in their latent space (left) and data 
projected on the first principal components (right). 

which discriminates perfectly the three groups. Moreover, to make the results of each method 
comparable, the same randomized initialization has been used for the 8 algorithms. The 
experimental process has been repeated 20 times for each dimension of the observed space 
in order to see both the average performances and their variances. Figure [5] presents the 
evolution of the clustering accuracy of each method (EM, PCA-EM, k-means, Mixt-PPCA, 
Fisher-EM, Diag-GMM, Sphe-GMM and Com-GMM) according to the data dimensionality 
and Figure [6] presents their respective boxplots. First of all, it can be observed that the Full- 
GMM, PCA-EM and Com-GMM have their performances which decrease quickly when the 
dimension increases. In fact, the Full-GMM model does not work upon the 15th dimension and 
still remains unstable in a low dimensional space as well as the Com-GMM model. Similarly, 
the performances of PCA-EM fall down as the 10th dimension. This can be explained by the 
fact that the latent subspace provided by PCA does not allow to well discriminate the groups, 
as already suggested by Figured However, the PCA-EM approach can be used whatever the 
dimension is whereas Full-GMM cannot be used as the 20th dimension because of numerical 
problems linked to singularity of the covariance matrices. Moreover, their boxplots show a 
large variation on the clustering accuracy. Secondly, Sphe-GMM, Diag-GMM and k-means 
present the same trend with high performances in low-dimensional spaces which decrease un- 
til they reach a clustering accuracy of 0.75. Diag-GMM seems however to resist a little bit 
more than k-means to the dimension increasing. Mixt-PPCA and Mclust both follow the 
same tendency as the previous methods but from the 30th dimension their performances fall 
down until the clustering accuracy reaches 0.5. The poor performances of Mixt-PPCA can be 
explained by the fact that Mixt-PPCA models each group in a different subspace whereas the 
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Figure 5: Influence of the dimension of the observed space on the correct classification rate 
for FuU-GMM, PCA-EM, Com-GMM, Mixt-PPCA, k-means, Diag-GMM, Sphe-GMM and 
Fisher-EM algorithms. 

model used for simulating the observations assumes a common discriminative subspace. Fi- 
nally, Fisher-EM appears to be more effective than the other methods and, more importantly, 
it remains very stable while the data dimensionality increases. Furthermore, the boxplot as- 
sociated with the Fisher-EM results suggests that it is a steady algorithm which succeeds in 
finding out the discriminative latent subspace of the data even with random initializations. 

5.3 Simulation study: model selection 

This last experiment on simulations aims to study the performance of BIC for both model and 
component number selection. For this experiment, 4 Gaussian components of 75 observations 
each have been simulated according to the DLMjq,^^] model in a 3-dimensional space completed 
by 47 orthogonal dimensions of Gaussian noise (the dimension of the observation space is 
therefore p = 50). The transformation matrix W has been again randomly simulated such 
as W^W = WW* = Ip. Table U] presents the BIC values for the family of DLM models 
and, in a comparative purpose, the BIC values for 7 other methods already used in the last 
experiments: EM with the Full-GMM, Diag-GMM, Sphe-GMM and Com-GMM models, Mixt- 
PPCA, Mclust |20| (with model [EEE] which is the most appropriate model for these data) 
and PCA-EM. Moreover, BIC is computed for different partition numbers varying between 2 
and 6 clusters. First of all, one can observe that the BIC values linked to the models which 
are different from the DLM model are very low compared to the DLM models. This suggests 
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Figure 6: Boxplots of Full-GMM, PCA-EM, Com-GMM, Mixt-PPCA, k-means, Diag-GMM, 
Sphe-GMM and Fisher-EM algorithms. 
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that the models which best fit the data are the DLM models. Secondly, 8 of the 12 DLM 
models select the right number of components (K = 4). In particular, the DLM models which 
assume a common variance between each cluster outside the latent subspace (models DLM[^]) 
all select the 4 clusters. The other methods under-estimate the number of clusters. BIC has 
the largest value for the DLMj^^^] model with 4 components which is actually the model used 
for simulating the data. Finally, the right-hand side of Figure [7] presents the projection of 
the data on the discriminative subspace of 3 dimensions estimated by Fisher-EM with the 
DLM[„^^] model whereas the left-hand side figure represents the projection of the data on 
the 3 first principal components of PCA. As one can observe, in the PCA case, the axes 
separate only 2 groups, which is in accordance with the model selection pointed out by BIC 
for this method. Conversely, in the Fisher-EM case, the 3 discriminative axes separate well 
the 4 groups and such a representation could clearly help the practitioner in understanding 
the clustering results. 

5.4 Real data set benchmark 

This last experimental paragraph will focus on comparing on real-world datasets the efficiency 
of Fisher-EM with several linear and nonlinear existing methods, including the most recent 
ones. On the one hand, Fisher-EM will be compared to the 8 already used clustering meth- 
ods: EM with the Full-GMM, Diag-GMM, Sphe-GMM and Com-GMM models, Mixt-PPCA, 
Mclust (with its most adapted model for these data), PCA-EM and k- means. On the other 
hand, the new Fisher-EM challengers will be k-means computed on the two first components of 
PCA (PCA-k-means), an heteroscedastic factor mixture analyzer (HMFA) method [43] and 
three discriminative versions of k-means: LDA-k- means |16| . Dis-k-means and DisCluster 
(see |53| for more details). The comparison has been made on 7 different benchmark datasets 
coming mostly from the UCI machine learning repository: 

• The chironomus data contain 148 larvae which are split up into 3 species and described 
by 17 morphometric attributes. This dataset is described in detailed in |33] . 

• The wine dataset is composed by 178 observations which are split up into 3 classes and 
characterized by 13 variables. 

• The iris dataset which is made of 3 different groups and described by 4 variables. This 
dataset has been described in detail in Section 15.11 

• The zoo dataset includes 7 families of 101 animals characterized by 16 variables. 

• The glass data are composed by 214 observations belonging to 6 different groups and 
described by 7 variables. 

• The 4435 satellite images are split up into 6 classes and are described by 36 variables. 
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number of components 


methods 


2 


3 


4 


5 


6 




-114.6172 


-114.5996 


-115.4875 


-115.6439 


-116.7350 




-116.9006 


-117.4791 


-115.0215 


-116.0837 


-116.8912 




-116.9007 


-116.9568 


-118.5480 


-119.3458 


-120.0418 


DLMp/3] 


-120.9006 


-120.2496 


-119.8787 


-120.6301 


-120.6166 




-116.5750 


-114.9578 


-114.7986 


-115.6658 


-116.5750 




-121.8565 


-117.4968 


-115.1525 


-115.8571 


-117.7598 




-115.2290 


-115.0808 


-114.7934 


-115.6603 


-116.5027 




-121.8565 


-117.6217 


-114.1471 


-115.7909 


-116.6739 




-116.7295 


-118.4031 


-119.2610 


-120.7783 


-122.0415 


-123.3448 


-120.9052 


-120.4578 


-121.1248 


-121.9098 




-118.7295 


-118.3865 


-119.7309 


-121.5124 


-123.1506 




-123.3443 


-120.8989 


-120.4347 


-121.7451 


-123.2730 


Full-GMM 


-177.6835 


-252.8908 


-440.6805 


-3005.531 


-4367.653 


Com-GMM 


-150.0518 


-193.0624 


-231.4546 


-270.2741 


-309.7809 


Mixt-PPCA 


-151.1561 


-176.3615 


-201.5709 


-226.7789 


-251.9931 


Diag-GMM 


-189.8663 


-262.7929 


-419.360 


-407.2755 


-466.6955 


Sphe-GMM 


-190.9812 


-258.3534 


-302.8030 


-382.7666 


-433.3845 


PCA-EM 


-127.0857 


-173.7174 


-247.3894 


-364.9811 


-594.4000 


Mclust[£;7/| 


-229.3360 


-229.3024 


-230.0155 


-230.8431 


-231.5140 



Table 4: BIG values for model selection. 



Projection on the 3 first principal components Projection on the discriminative axes estimated by Fisher-EM 




Figure 7: Projection of the data in the 3 first principal components of PGA (left) and in the 
discriminant subspace estimated by Fisher-EM with the DLMjq,^^]. 
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• Finally, the last dataset is the USPS data where only the classes which are difficult to 
discriminate are considered. Consequently, this dataset consists of 1756 records (rows) 
and 256 attributes divided in 3 classes (numbers 3, 5 and 8). 

Table [5] presents the average clustering accuracies and the associated standard deviations 
obtained for the 12 DLM models and for the methods already used in the previous experiments. 
The results for the 19 first methods of the table have been obtained by averaging 20 trials with 
random initializations except for Mclust which has its own deterministic initialization and this 
explains the lack of standard deviation for Mclust. Similarly, Table [6] provides the clustering 
accuracies found in the literature for the recent methods on the same datasets. It is important 
to notice that the results of Table [S] have been obtained in slightly different benchmarking 
situations. Missing values in Table [5] are due to non-convergence of the algorithms whereas 
missing values in Table [H] are due to the unavailability of the information for the concerned 
method. First of all, one can remark that Fisher-EM outperforms the other methods for most 
of the UCI datasets such as wine, iris, zoo, glass, satimage and usps358 datasets. Finally, it is 
interesting from a practical point of view to notice that some DLM models work well in most 
situations. In particular, the DLM[^] models, in which the variance outside the discriminant 
subspace is common to all groups, provide very satisfying results for all the datasets considered 
here. 

6 Application to mass spectrometry 

In this last experimental section, the Fisher-EM procedure is applied to the problem of cancer 
detection using MALDI mass spectrometry. MALDI mass spectrometry is a non-invasive 
biochemical technique which is useful in searching for disease biomarkers, assessing tumor 
progression or evaluating the efficiency of drug treatment, to name just a few applications. 
In particular, a promising field of application is the early detection of the colorectal cancer, 
which is one of the principal causes of cancer-related mortality, and MALDI imaging could in 
few years avoid in some cases the colonoscopy method which is invasive and quite expensive. 

6.1 Data and experimental setup 

The MALDI2009 dataset has been provided by Theodore Alexandrov from the Center for 
Industrial Mathematics (University of Bremen, Germany) and is made of 112 spectra of 
length 16 331. Among the 112 spectra, 64 are spectra from patients with the colorectal 
cancer (referred to as cancer hereafter) and 48 are spectra from healthy persons (referred to 
as control). Each of the 112 spectra is a high-dimensional vector of 16 331 dimensions which 
covers the mass-to-charge (m/z) ratios from 960 to 11 163 Da. For further reading, the dataset 
is presented in detail and analyzed in a supervised classification framework in |3J. 
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Method 


iris 


wine 


chironomus 


zoo 


glass 


satimage 


usps358 


DLMrv R 1 


94.8±2.3 


96.1±0.0 


91.7±5.2 




39.5±1.8 


64.6±2.2 


77.9±7.1 


DLMrs.sl 

L-^fcPJ 


96.7±0.0 


95.5±0.0 


97.2±0.1 




39.9±1.4 


65.7±0.8 


70.0±8.5 




81.9±2.4 


94.1±1.3 


91.8±2.4 


73.3±5.5 


40.6±0.9 


62.7±1.9 


74.1±9.4 


DLMry«i 

[Zjp J 


77.8±3.7 


93.6±1.6 


89.1±6.3 


78.4±6.4 


38.5±1.9 


68.0±1.7 


66.4±8.7 




89.3±0.0 


95.5±0.0 


86.1±6.3 


73.7±3.5 


42.0±2.2 


65.5±2.0 


74.8±9.1 




91.1±1.4 


94.2±0.2 


96.3±7.0 


70.4±5.3 


40.1±3.3 


65.0±2.9 


68.7±11.1 




96.1±2.2 


95.5±0.0 


87.5±3.9 


73.7±3.6 


39.2±3.7 


64.4±2.1 


76.2±7.6 




98.0±0.0 


94.3±0.0 


96.2±6.8 


72.8±3.1 


40.1±2.0 


58.9±5.3 


74.1±10.6 




79.3±3.6 


93.8±2.8 


83.7±3.9 


72.5±7.0 


39.4±0.9 


62.4±1.8 


77.8±8.2 




72.7±6.5 


92.6±3.2 


89.7±6.3 


80.1±4.2 


39.5±1.5 


68.0±1.5 


74.2±11.2 




80.3±4.3 


96.3±1.9 


83.6±8.5 


70.2±7.0 


39.1±2.4 


62.4±2.5 


81.2±6.5 




79.8±4.0 


97.1±0.0 


89.8±6.6 


78.0±4.8 


38.4±1.3 


67.9±1.3 


72.8±9.8 


Full-GMM 


79.0±5.7 


60.9±7.7 


44.8±4.1 




38.3±2.1 


35.9±3.1 




Com-GMM 


57.6±18.3 


61.0±14.9 


51.9±10.9 


59.9±10.3 


38.3±3.1 


26.1±1.5 


38.2±1.1 


Mixt-PPCA 


89.1±4.2 


63.1±7.9 


56.3±4.5 


50.9±6.5 


37.0±2.3 


40.6±4.7 


53.1±9.6 


Diag-GMM 


93.5±1.3 


94.6±2.8 


92.1±4.2 


70.9±12.3 


39.1±2.4 


60.8±5.2 


45.9±9.1 


Sphe-GMM 


89.4±0.4 


96.6±0.0 


85.9±9.9 


69.4±5.4 


37.0±2.1 


60.2±7.5 


78.7±11.2 


PCA-EM 


66.9±9.9 


64.4±5.7 


66.1±4.0 


61.9±6.2 


39. Oil. 7 


56.2±4.2 


67.6±11.2 


k-means 


88.7±4.0 


95.9±4.0 


92.9±6.0 


68.0±7.4 


41.3±2.8 


66.6±4.1 


74.9±13.9 


Mclust 


96.7 


97.1 


97.9 


65.3 


41.6 


58.7 


55.5 


Model name 


(VEV) 


(VVI) 


(EEE) 


(EII) 


(VEV) 


(VVV) 


(EEE) 



Table 5: Clustering accuracies and their standard deviations (in percentage) on the UCI 
datasets averaged on 20 trials. No standard deviation is reported for Mclust since its initial- 
ization procedure is deterministic and always provides the same initial partition. 



Method 


wine 


iris 


chironomus 


zoo 


glass 


satimage 


usps358 


PCA-k-means [16 


70.2 


88.7 




79.2 


47.2 






LDA-k-means [TCj 


82.6 


98.0 




84.2 


51.0 






Dis-k-means [53] 












65.1 




DisCluster [53] 












64.2 




HMFA [43| 






98.7 











Table 6: Clustering accuracies (in percentage) on UCI datasets found in the literature (these 
results have been obtained with slightly different experimental setups). 
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Figure 8: Estimated mean spectra of the cancer class (up) and of the control class (bottom) 
on the m/z interval 900-3500 Da. 



Following the experimental protocol of [3], Fisher-EM was applied on the 6 168 dimen- 
sions corresponding to m/z ratios between 960 and 3 500 Da since there is no discriminative 
information on the reminder. Figure [8] shows the mean spectra of the cancer and control 
classes estimated by Fisher-EM on the m/z interval 900-3500 Da. To be able to compare 
the clustering results of Fisher-EM, PCA-EM and mixture of PPCA (Mixt-PPCA) have been 
applied to this subset as well. It has been asked to all methods to cluster the dataset into 2 
groups. It is important to remark that this clustering problem is an <^ p problem and, among 
the model-based methods, only these three methods are able to deal with it (see Section . 



6.2 Experimental results 

Table [7] presents the confusion tables computed from the clustering results of PCA-EM, mix- 
ture of PPCA and Fisher-EM. On the one hand, PCA-EM has selected d = 4 principal axes 
with the 90% variance rule before to cluster the data in this subspace and mixture of PPCA 
has selected d = 2 principal axes for each group. On the other hand, Fisher-EM has estimated 
the discriminative latent subspace with d = K — 1 = \ axis to cluster this high-dimensional 
dataset. It first appears that PCA-EM and mixture of PPCA provide satisfying clustering re- 
sults on such a complex dataset. However, it is disappointing to see that the PCA-EM make a 
significant number of false negatives (cancers classified as non-cancers) since the classification 
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PCA-EM 




Cluster 


Class 


Cancer Control 


Cancer 


48 16 


Control 


1 47 


Misclassification rate = 0.15 



Mixt-PPCA 




Cluster 


Class 


Cancer 


Control 


Cancer 


62 


2 


Control 


10 


38 


Misclassification rate 


= 0.11 



Fisher-EM 




Cluster 




Class 


Cancer Control 


Cancer 


57 


7 


Control 


3 


45 


Misclassification rate = 


- 0.09 



Table 7: Confusion tables for PCA-EM (left), mixture of PPCA (center) and Fisher-EM 
(right). 



risk is not symmetric here. Conversely, mixture of PPCA and Fisher-EM provide a better 
clustering results both from a global point of view (respectively 89% and 91% of clustering 
accuracy) and from a medical point of view since Fisher-EM makes significantly less false 
negatives with an acceptable number of false positives. 

More importantly, Fisher-EM provides information which can be interpreted a posteriori 
to better understand both the data and the phenomenon. Indeed, the values of the estimated 
loading matrix U , which is a 6 168 x 1 matrix here, expressed the correlation between the 
discriminative subspace and the original variables. It is therefore possible to identify the 
original variables with the highest power of discrimination. It is important to highlight that 
Fisher-EM extracts this information from the data in a unsupervised framework. Figure [9] 
shows the correlation between each original variable and the discriminative subspace on an 
arbitrary scale. The peaks of this curve correspond to the original variables which have a high 
correlation with the discriminative axis estimated by Fisher-EM. 

Figure [To] plots the difference between the mean spectra of the classes cancer and control 
(cancer - control) and indicates as well, using red triangles, the most discriminative original 
variables (m/z values). It is not surprising to see that original variables where the cancer and 
control spectra have a big difference are among the most discriminative. More surprisingly, 
Fisher-EM selects the original variables with m/z values equal to 2800 and 3050 as discrim- 
inative variables whereas the difference between cancer and control spectra is less for these 
variables than the difference on the variable with m/z value equal to 1350. Such information, 
which have extracted from the data in a unsupervised framework, may help the practitioner 
to understand the clustering results. 



7 Conclusion and further works 

This work has presented a discriminative latent mixture model which models the data in a 
latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimen- 
sion of the original space. A family of 12 parsimonious DLM models has been exhibited by 
constraining model parameters within and between groups. An estimation algorithm, called 
the Fisher-EM algorithm, has been also proposed for estimating both the mixture parameters 
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Figure 9: Discrimination power of the original variables: correlation between original variables 
and the discriminative subspace on an arbitrary scale. 
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Figure 10: Difference between the mean spectra of the classes cancer and control (cancer 
control) and most discriminative variables (indicated by red triangles). 
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and the latent discriminative subspace. The determination procedure for the discriminative 
subspace adapts the well-known Fisher criterion to the unsupervised classification context 
under an orthonormality constraint. Furthermore, when the number of groups is not too 
large, the estimated discriminative subspace allows a useful projection of the clustered data. 
Experiments on simulated and real datasets have shown that Fisher-EM performs better than 
existing clustering methods. The Fisher-EM algorithm has been also applied to the clustering 
of mass spectrometry data, which is a real-world and complex application. In this specific 
context, Fisher-EM has shown its ability to both efficiently cluster high-dimensional mass 
spectrometry data and give a pertinent interpretation of the results. 

However, the convergence of the Fisher-EM algorithm has been proved in this work only 
for 2 of the DLM models and the convergence for other models should be investigated. We feel 
that the convergence could be proved for these models at least in a generalized EM context. 
Among the other possible extensions of this work, it could be interesting to find a way to 
visualize in 2D or 3D the clustered data when the estimated discriminative subspace has 
more than 4 dimensions. Another extension could be to consider a kernel version of Fisher- 
EM. For this, it would be necessary to replace the Gram matrix introduced in Section 14.61 
by a kernel. Finally, it could be also interesting to introduce sparsity in the loading matrix 
through a ii penalty in order to ease the interpretation of the discriminative axes. 
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A Appendix 

In order not to surcharge the notations, the index q of the current iteration of the Fisher-EM 
algorithm is not indicated in the following proofs. We also define the matrices W and W such 
that = -|- W. The matrix W is defined as a, p x p matrix containing the d first vectors 
of W completed by zeros such a,s W = [U, Op-d] and W = W — W is defined by I^ = [0^, V]. 

A.l E step 

Proof of PropositionUl The conditional expectation tik = E[P{zik\yi,@)] can be viewed as 
well as the posterior probability of the observation yj given a group k and, thanks to the 
Bayes' formula, can be written: 
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where <f> is the Gaussian density, and VTfc and 9k are the parameters of the A;th mixture compo- 
nent estimated in the previous iteration. This posterior probabihty tik can also be formulated 
from the cost function Tk such that: 

E/=iexp(i(rfe(yi) -r,(yi)))' ^ ^ 

where Tk{yi) = —2\og{TTk(t>{yi,0k))- According to the assumptions of the model DLMpj.;3j.] 
and given that W = W + W ^Vy^ can be reformulated as: 

Tk{yi)= {y^-mk)'WA^^W\yi-mk) + {v^-mkYW^]^^W\yi-mk) ^^^^ 
+ log(|Afe|) - 21og(7rfe) +p log(27r), 

Moreover, since the relations = W and = W hold due to the construction 

of W and W, then: 

^kiVi) = {wW\yi - mk))' VFA-^W^* {ww\yi - m^)) 

+ ^ {WW\yi - mfc))* {WW\yi - m^)) (A.4) 

Pk 

+ log(|Afe|)-21og(7rfe)+p log(27r). 

Let us now define = W/^-^^W^ and ||.||^^, a norm on the latent space spanned by W , such 
that \\y\\^^ = y^'&kV- With these notations, and according to the definition of A^, can be 
rewritten as: 



TkiVi) = \\WW\yi - mk)\\l^ + —\\WW\yi - mk)\\ 



1 

Pk"" " (A.5) 
+ log(|Sfc|) + {p-d) log(;3jfc) - 21og(7rjk) +p log(27r). 



Let us also define the projection operators P and P on the subspaces E and E respectively: 

• P{y) = WW^y is the projection of y on the discriminative space E, 

• P^{y) = WW^y is the projection of y on the complementary space E"*-. 
Consequently, the cost function can be finally reformulated as: 

1 



^k{yi)= \\P{yi-rnk)\\^^ + P iyi-rrik) 

+ log(|Sjk|) + {p-d) log(;3jk) - 21og(7rjt) +p log(27r). 



(A.6) 



Since P'^{y) = y — P{y), then the distance associated with the complementary subspacc can 
be rewritten as — mfc)|p = — mk) — P{yi — and this allow to conclude. □ 
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A. 2 M step 

Proof of Proposition\^ In the case of the model DLMp^.^^.], at iteration g, the conditional ex- 
pectation of the complete log-likelihood Q{yi, ■ ■ ■ , yn-,G\6^'^~'^^) of the observed data {yi, . . . , yn\ 
has the following form: 



n K 



^=l k=l ^ ^ 

= ^^tik[-^log{\Sk\)--{yi-mkYSf:'^{yi-mk) + log{TTk)-^ log(27r) , 

i=l k=l 

where tik = E[zik\9^''^^^- According to the definitions of the diagonal matrix and of the 
orientation matrix W for which W^'^ = W^, the inverse covariance matrix of Y can be 
written as 5^^ = (WAkW^)'^ = W'^Ai^^W'^ = T^A^^M^* and the determinant of Sk can 
be also reformulated in the following way: 

|5fc| = lAfcl = (A.8) 

Consequently, the complete log-likelihood Q{9) can be rewritten as: 

K 



=-^X^nfe[-21og(7rfc)+log(|Efc|) + (p-d) \og{/3k) 

k=l 

1 " 1 

+ — V tikiyi - mkYWA-^W\yi - m^) + 7 . 



(A.9) 



nk . , 
1=1 



where = X^^Li ^ik and 7 = plog(27r) is a constant term. At this point, two remarks can be 
done on the quantity X^iLi ^ikiyi — n^kYW A'j^^W^ {yi — nik)- First, as this quantity is a scalar, 
it is equal to its trace. Secondly, this quantity can be divided in two parts since W = [U, V] 
andW = W + W. Then, the relation VFA^^I^* = WA'^^W^ + WA^^W^ is stated and we 
can write: 

{yi-mk)^WA^^W\yi-mk) = trace (^{yi - mkYW A^^W\yi - nik)) 

+ trace ((y^ - mkfW A~^W\yi - nik)) . 

Moreover, pointing out that Ck = X^ILi ^ik{yi ~ 'rnk){yi — ""T-fc)* is the empirical covariance 
matrix the fcth group, the previous quantity can be rewritten as: 

1 " 

— ^i^^y^ - mkYWA~^W\yi - ruk) = tTace{A-^W'CkW) + tTace{A^'W'CkW) (A.ll) 
i=i 
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and finally: 

1 " ^~'^v^CkV- 
— Y,tik{v^-^kfWl^l^W\yi-mk)= trace(S^iC/*CfcC/) + ^Z^^' (^'l^) 

"^k . , . , Pk 

1=1 3=1 

where Vj, is the jth column vector of V . However, since W = W — W and W = [U, V], it is 
also possible to write: 

^ p-d 1 / ^ 

i=i \i=i i=i 

=^ ( E trace(t«j^C7,) - ^ u'^CkUj ) (A.13) 

= — trace(Cfc) - ^MjCfeUj . 
/^fc i=i 

Consequently, replacing this quantity in (|A.9P provides the final expression of Q{9). □ 

Proof of Proposition\^ The maximization of conduces for the DLM models to the fol- 
lowing estimates. 



Estimation of VTk The prior probability VTfc of the group k can be estimated by maximizing 
Q{e) with 
function: 



Q{9) with respect to the constraint X^^i '^k = ^ which is equivalent to maximize the Lagrange 



L = Q{e) + \[^^T,k-l^, (A.14) 

where A is the Lagrange multiplier. Then, the partial derivative of L with respect to vr^ is 
^ = ^ + A. Consequently: 

'ik = l,...,K ^ = + X = nk + XiTk = 0, (A.15) 

and: 

K 

^(nfc + AvTfc) = n + A = ^ A = -n. (A. 16) 

k=l 

Replacing A by its value in the partial derivative conduces to an estimation of VTfc by: 

nk = ^. (A.17) 
n 
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Estimation of //k The mean Hk of the kth group in the latent space can be also estimated 
by maximizing the expectation of the complete log-likelihood (equation lA.Tp . which can be 
written in the following way: 



n K 



^(^) = E Z [-^ log(I5fcl) -^iVi- UfikyS^\y, - C//i,) + log(7r,) - | log(27r)] . (A.18) 



i=l k=l 



Consequently, the partial derivative of Q with respect to is = ~\Y^=i^ikU* {Ui 

Ufik)- Setting this quantity to gives: 



n 



= ^ j2*^kU% = ^ti,f,k. (A.19) 

^^1" t=l i=l 

and conduces to: 

1 " 

fik = —Y^UkU'vi. (A.20) 

1=1 

Model DLMpj,^;,] From Equation (|4.7p . the partial derivative of Q{6) with respect to Sfc 
has the following form: 

^ = -y ^ [log(lS.I) + trace {^-,'U'C,U)] . (A.21) 

Using the matrix derivative formula of the logarithm of a determinant, ^'"f^"^^^ = (^^^)*> 
and of the trace of a product, — — = — {^A^^BA^^Y , the equality of ^q^^J to the dx d 
zero matrix yields to the relation: 

= ^^^U'CkU^^\ (A.22) 
and, by multiplying on the left and on the right by S^, we find out the estimate of S^: 

±k = U'CkU. (A.23) 
The estimation of /3fc is also obtained by maximizing Q subject to /3fc: 

_n ^ ^"^ trace(Cfc) 1 V^^f^ „ _„ ( a oa) 

^-'^—k W~ MU ' " 

and it is possible to conclude: 

trace(Cfe) - Yfj=i upkUj 

f3k = j . A. 25 

p — a 

Model DLMpj./?] In this case, Q has the following form: 
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fc=i 

K d 

+ ""kip - d) log(/3) + [trace(Cfc) - ^ ^C^^j 

fc=i fc=i ^ j=i 

1 ^ 

= - 2 (E [-2 log(7rfc) + trace(S^iC/*Cfc^7) + log(|Sfcj) + 7 

1 



(A.26) 



k=l 

n{p-d) log(/3) 



/3 



trace(C) — n'^^UjCuj 



where C is the soft within covariance matrix of the whole dataset. Setting to the partial 
derivative of 
conduces to: 



derivative of Q{0) conditionally to /3 implies — ■^trace(C) + -p- Yl'j=i u^jCuj = and this 



/3 



P 



— - trace(C) — > u\Cu^ 



and the estimation of is given by Equation ()A.23p . 



Model DLMp/3j.] The quantity Q can be rewritten in this manner: 



(A.27) 



+ n log(|S|) +n trace(S^^f/*C7[/) 



Q(^)=-\(Sl^k [-21og(7r,, 

k=l 

+ Ynk \^ip - d) log(/3fc) + — trace(Cfc) - ^ u'-CkUj + 



7 



then, the partial derivative of Q{9) with respect to S is: 

dQ{e) n d 



5S 2 5S 

and setting to provides the estimation of S: 



[log(|E|) + trace {^-^U^CU)] 



(A.28) 



(A.29) 



(A.30) 



Finally, the estimation of is provided by Equation (IA.25p . 

Model DLMp^] The estimations of S and /3 have been already considered above and are 
given by Equations (jXSOl and 1X27]) . 
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Model DLM^^^.p^j In this case, Q has the fohowing form: 

Qie) = --^nk\-2log{7rk)+Y,{^ogiakj) + ^ -]+ip-d) log(/3fc)+- Yl ^PkV,+^ 

^ k=i j=i \ J j=d+i 

(A.31) 

The partial derivative of Q with respect to a^j is = _^J_ / ^ i ^ | and setting 

to provides the estimate of a^j: 

Q^fcj = u^jCkUj. (A. 32) 

The estimation of /J^ is provided by Equation ()A.25p . 

Model DLM[q^^,^] The estimations of a^j and /3 have been aheady considered above and 
are given by Equations (fAT32l and 1X27]) . 

Model DLM[„^^^.] For this model, the expectation of the complete log-likelihood Q{6) has 
the following form: 

k=l j=l \ oik J 

+ {p-d) log(/3fc) + yY^ VjCkVj + 7] , 

Q{0) = -i;Ynk\-2 log(7rfc) + d log(afc) + — X] ^^^kUj 
^ k=i j=i 
^ p-d 

+ {p-d) log(/3fc) + -rY^ vjCkVj + 7 

The partial derivative of Q{B) with respect to is = ~2n^ (c^ ~ ^ Sj=i ''^j'^fc^^j^ > 
and setting this quantity to 0, provides: 

1 

ak = -Yu]CkUj. (A.34) 

On the other hand, the estimation of is the same as in Equation ()A.25p . 

Model DLMjq,^^] The estimations of at and /3 are respectively provided by Equations (IA.34P 
and dAHl). 



(A.33) 
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Model DLM[fj^.^^] In this case, Q{9) has the fohowing form: 



QiO) 



" k=l 3=1 \ "-^ / 

1 ^"'^ \ 

+ {p-d) log(/3fc) + ^j'^'^^J + V ' 

fc=i 1=1 1=1 ^ 



p—d 



j=i 



7 



k=l 



(A.35) 



The partial derivative of Q{9) with respect to aj is = — § — ^u* Cnj^ and setting 

to imphes: 



and the estimation of is the same as in Equation (|A.25p . 



(A.36) 



Model DLMjq^,^] The estimations of aj and /3 are respectively provided by Equations (|A.36P 
and (1X271) . 



Model DLM[q^^] In this case, Q{9) has the following form: 



1 ^ 1 
Qi0) = - - ^nfc(-21og(7rfe) + d log(a) + - ^u*Cfcnj+ 

fe=i i=i 



p—d 



f - 

+ {p-d) log(/3fc) + ^ X] ^i'^'^^i + 



Q{0) = - -(^Y,nk[-2log{TTk)]+n d log(a) + ^Y.u'jCuj 
k=i " 1=1 



X ^p-d 

Uk \{p - d) log(/3fc) + VjCkVj + 

P'' i=i 



fc=i 



8Q(e) 



■7 



(A.37) 



The partial derivative of Q{B) with respect to a is — 
setting this quantity to 0, we end up with: 



d _ 1 ^d 

2 \a 1? 



Y.]=xU^jCu^ , and 



a 



1 



(A.38) 
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The estimation of is the same as in Equation ()A.25p . 



Model DLM[q,^] The estimations of a and (5 have been aheady computed and are provided 
by Equations (jXSS]) and (jX22l). □ 
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